You see what I want you to see: poisoning vulnerabilities in neural code search

Wan, Yao; Zhang, Shijie; Zhang, Hongyu; Sui, Yulei; Xu, Guandong; Yao, Dezhong; Jin, Hai; Sun, Lichao

Title: You see what I want you to see: poisoning vulnerabilities in neural code search
Creator: Wan, Yao; Zhang, Shijie; Zhang, Hongyu; Sui, Yulei; Xu, Guandong; Yao, Dezhong; Jin, Hai; Sun, Lichao
Relation: ESEC/FSE 2022: 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Proceedings of 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore 14-18 November, 2022) p. 1233-1245
Publisher Link: http://dx.doi.org/10.1145/3540250.3549153
Publisher: ACM
Resource Type: conference paper
Date: 2022
Description: Searching and reusing code snippets from open-source software repositories based on natural-language queries can greatly improve programming productivity.Recently, deep-learning-based approaches have become increasingly popular for code search. Despite substantial progress in training accurate models of code search, the robustness of these models has received little attention so far. In this paper, we aim to study and understand the security and robustness of code search models by answering the following question: Can we inject backdoors into deep-learning-based code search models? If so, can we detect poisoned data and remove these backdoors? This work studies and develops a series of backdoor attacks on the deep-learning-based models for code search, through data poisoning. We first show that existing models are vulnerable to data-poisoning-based backdoor attacks. We then introduce a simple yet effective attack on neural code search models by poisoning their corresponding training dataset. Moreover, we demonstrate that attacks can also influence the ranking of the code search results by adding a few specially-crafted source code files to the training corpus. We show that this type of backdoor attack is effective for several representative deep-learning-based code search systems, and can successfully manipulate the ranking list of searching results. Taking the bidirectional RNN-based code search system as an example, the normalized ranking of the target candidate can be significantly raised from top 50% to top 4.43%, given a query containing an attacker targeted word, e.g., file. To defend a model against such attack, we empirically examine an existing popular defense strategy and evaluate its performance. Our results show the explored defense strategy is not yet effective in our proposed backdoor attack for code search systems.
Subject: code search; software vulnerability; deep learning; backdoor attack; data poisoning
Identifier: http://hdl.handle.net/1959.13/1493927
Identifier: uon:53669
Identifier: ISBN:9781450394130
Language: eng
Reviewed

Hits: 3443
Visitors: 3441
Downloads: 0

		Thumbnail	File	Description	Size	Format